Author: Hang He

GitHub Page: https://GavinHHE.github.io.

Class: CMPS 3160

Dataset Source:

From CDC:

PLACES: Local Data for Better Health, Census Tract Data 2020 release:https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh

From Kaggle:

Cardio Vascular Disease Detection: https://www.kaggle.com/bhadaneeraj/cardio-vascular-disease-detection

Diabetes Health Indicators Dataset: https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset

Code Reference: https://plotly.com/python/choropleth-maps/

Project Plan & Project goals:

  1. Introduction

  2. ETL and EDA

  3. Model Construction and Evaluation

  4. Conclusion

Introduction:

Background and Project Goals

From heart.org, an artical states that nearly half of American adults have high blood pressure. As we know, most of the time, high blood pressure (HBP, or hypertension) has no obvious symptoms to indicate that something is wrong. Articals from CDC also states that only about 1 in 4 adults (24%) with hypertension have their condition under control. It develops slowly over time and can be related to many causes. According to the CDC Heart disease(Cardiovascular disease), cancer and diabetes are currently among the most expensive health conditions in the United States. I believe that it is meaningful to study whether High Blood Pressure is positively related to Diseases: Cardiovascular Disease and Diabetes.

For the project, I will deep dive the health data from CDC and Cardiovascular Disease data from Kaggle by visualization and analysis. The final report will include a visualization of percentage of population that has HBP by states and analysis on the importance of HBP as a risk factor to Cardiovascular Disease and Diabetes. I will also include machine learning models to make predictions using available data. Hopefully, models would be able to predict whether a specific person has cardiovascular Disease or Diabetes accurately.

The main question for my project is "How risky is HBP? Is High Blood Pressure positively related to either Cardiovascular Disease or Diabetes?". In addition to that, I would also like to analyze other important risk factors that are related to both diseases.

About the dataset

All data can be aslo found in the repository.

Census Tract Data 2020 release(2017 to 2018) is filled with data regarding the overall responses of surveys conducted by multiple organizations. Columns in the datasets includes when and where the survey was conducted, total population involved, descriptions of the question asked, and the responses value in percentage. I will use this dataset to visualize HBP rate and perform some basic caulation.

Cardio Vascular Disease Detection is filled with the data regarding people both with and without Cardiovascular Disease. Personal information including age, gender, height, weight, blood pressure measurement and etc. The dataset also have columns indicating smoking, drinking and exercise status. I will also assess those three risk factors in the machine learning part.

Diabetes Health Indicators Dataset is filled with the data regarding people both with and without Diabetes.Columns include whether the person smoking or drinking, age group, education level, income level, gender and etc. There is no missing values in dataset. Many of the columns are catergorical or boolean variables.

For all three dataset, there is no missing value. Data from Census Tract Data 2020 release are clean and do not need futhur data cleaning. There are many extreme values that are unrealistic in Cardio Vascular Disease Detection. I will deal with those values by droping or transforming. Outliers in Diabetes dataset are not removed in the EDA part, but I will transform or drop outliers when doing pridictions. Most outlier in Diabetes Health Indicators Dataset are removed in the EAD part.

ETL and EDA:

here I will load the Census Tract Data 2020 release. There is a description about the columns on the website: https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh. Since I am only interested in the survey results about blood pressure, I will slice the data and choose columns: ['Year','StateAbbr','StateDesc','CountyName','Measure','Data_Value_Unit','Data_Value','Geolocation']

Load Cardio Vascular Disease Detection data

Features:

Age | Objective Feature | age | int (days)

Height | Objective Feature | height | int (cm) |

Weight | Objective Feature | weight | float (kg) |

Gender | Objective Feature | gender | categorical code |

Systolic blood pressure | Examination Feature | ap_hi | int |

Diastolic blood pressure | Examination Feature | ap_lo | int |

Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |

Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |

Smoking | Subjective Feature | smoke | binary |

Alcohol intake | Subjective Feature | alco | binary |

Physical activity | Subjective Feature | active | binary |

Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

Remove extreme values

There is no explaination on the meaning of 1 and 2 of gender column. After compared the mean weight and height, I figured out Gender value 2 is male, Gender value 1 is female.

Accoring to the documentation, age is stored in days.

By looking at the box plots of height,weight,ap_hi and ap_lo, there are many extreme values that some are unrealistic. I will remove the unrealistic values. Outliers could be meaningful here, I will transform outliers later.

According to https://en.wikipedia.org/wiki/List_of_the_verified_shortest_people, the shortest recorded is 54.6. I will remove rows that has highet lower than 55

I observed many outliers in both ap_hi and ap_low. According to https://pubmed.ncbi.nlm.nih.gov/7741618/, the highest highest pressure recorded is 370/360, I will remove rows that has ap_hi or ap_low higher than 360

It is unrealistic to have living person with DIASTOLIC pressure equals to or greater than SYSTOLIC pressure

It is unrealistic to have living person wit DIASTOLIC and SYSTOLIC pressures less than 50

Accoring to the documentation, gender,cholesterol,gluc,smoke,alco,active, and cardio are category variables

Values of 1, 2 and 3 are hard to interpret for columns cholesterol and gluc. I maped both columns according to the data spcification provided.

By looking at the distribution of those 5 columns, weight and height seems follow normal distribution. There are some outliers in weight and height, but the number of outliers is too small to be shown on the grah. Age is postively skewed. It is hard to tell the distribution of ap_lo and ap_hi.

Here, I compared the mean value for those 5 columns. The graphs show that people with cardiovascular disease are slightly orlder and have higher blood pressure measurement.

I visualized the number of smokers and drinkers among people with cardiovascular diseases. From the graph, I would say the smoking or drinking may has no significant impact on cardiovascular diseases.

I also created a heatmap to show the correlation between variables. The disease indicator, cardio, has high correlation with ap_hi and ap_low. Age and weight are also correlated with cardio.

Load diabetes dataset

Conver the data type according to the column description provided by data uploader

Since most of columns are either catergorcial variables or boolean variables, I will check the distribution of BMI only

In rare cases people have BMI greater than 50, I will drop rows with BMI higher than 50.

There are few outliers, the potential effect of outliers in this case should be small since we have only few outliers

BMI data seems follow normal distribtion. From another graph, I observed that people with Diabetes are tend to have higher BMI.

Among people who smokes, there is a higher chance that he/she has diagnosed with Diabetes already. Among people who have high blood pressure, there is a significant higher chance that he/she has diagnosed with Diabetes already.

I observed that the correlations between Diabetes_binary and HighBP/ BMI are relatively higher. There are also high correlation between PhysHlth and DiffWalk and between MentHleth and PhysHlth. I will remove PhysHlth when applying classification model on it.

Model Construction and Evaluation

Hypothesis Test

1. High Blood Pressure is positively correlated to both Cardiovascular Disease and Diabetes.

2. Hypertension is the main cause of both Cardiovascular Disease and Diabetes.

With the machine learning model, I will be able to determine whether High Blood Pressure is positively related to Cardiovascular Disease and Diabetes. Also, risks of other factors can be measured by the coefficient of the model result.

Model Construction

For both dataset, I can use Logistic regression or Randomforest to make meaningful predictions and find the importance of HBP as a risk factor. Both datasets are labled and have only two possible values(0 or 1). I will start with Logistic regression since it is good for classification prediction and the coefficients are more interpretable.

For Cardio Vascular Disease Detection, I will incorporate BMI as a new feature beacuse I think BMI is a better indicator to show the overall health of a person than the combination of height and weight. I will also transform BMI and age to categorical variables. I believe that each category of BMI and Age is more meaningful than just numbers.

Categorical variables in Cardio Vascular Disease Detection data will be transform to dummy variables. For both Cardio Vascular Disease Detection and Diabetes data, although there is no columns contains extremly large or small values, I will standardizing the data. There are still many outliers in the Cardio Vascular Disease dataset after I removed extrme values or values are unrealistic. I will replace the rest of outliers by boundary values that calculated using 2.0 x Interquartile Range.

The target variable for Cardio Vascular Disease Detection data is cardio, which indicate whether a person has Cardioascular Disease or not. The dependent variable will be the rest of the columns including age_range, gender and ect. Ap_lo is highly correlated to Ap_hi(0.74 from the heat map), so I will remove Ap_lo in further analysis. For diabetes dataset, the target variable is Diabetes_binary, which indicate whether a person has Diabetes or not. The dependent variable will include rest of columns exclude the PhysHlth since it is highly correlated with MentHlth and DiffWalk. I will also remove AnyHealthcare and NoDocbcCost because I do not think either columns are related to Diabetes.

Both dataset are balanced and do not need to resample data.

As I mentioned, there are many outliers in the dataset. BMI, which is caculated by height and weight, also contains outliers.If I just drop those outliers, I might lose some important information. Therefore, I will tolerate outliers to some extend. I will replace the outliers by values that calculated using 2.0 x Interquartile Range, which I think are more acceptable. Since height and weight will be replace by BMI and Ap_lo will be removed because of high correlation, I will transform BMI and ap_hi only.

Before I run the LogisticRegression model using Sklearn, I analyze the significance(p_value < 0.05) for each variable using statsmodels. I observed that gluc_above normal, age_range_50, 60 ,and BMI_range_Below 18 are not significant.

How to interpret: https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1

As variable cholesterol_well above normal increases by one unit(from 0 to 1), the odds that this person is in the target class (“1”) are over 3x as large as the odds that he/she won’t be in the target class. On the other hand, as age_range_30, 40 increases by one unit(from 0 to 1), the odds that the observation is NOT in the target class are 1/0.44 or 2.27x as likely as the odds that it IS in the target class.

Next, I will run Logistic Regression on diabetes dataset

Next, I will remove BMI and BMI_range_18, 25 to avoid multicollinearity. PhysHlth will also be removed because of high correlation to DiffWalk according to the heat map in the EDA section.

Again, I will test the significance for each feature before I run the Logistic Regression model for diabetes dataset. Every features in diabetes dataset are significant( p-value < 0.05) so I do not need to remove any columns.

Conclusion & Summary

Logistic Regression Model Accuracy:

For Cardio Vascular Disease Detection data: 73.04%
For Diabetes Health Indicators Dataset: 74.87

For both models, I think 73.04% accuracy and 74.87% accuracy are not strong enough if we want to use these models to predict whether a person has Cardiovascular Disease or Diabetes in real life. However, these models still can help us to identify risk factors from features in datasets. The results support my first hypothesis that High Blood Pressure is positively correlated to both Cardiovascular Disease and Diabetes. To test my first hypothesis, we can focus on the odd from coefficient table. It is clear that hypertension is postively related to the Cardiovascular Disease. From the coefficient table, as variable ap_hi increases by one unit, the odds that this person has Cardiovascular Disease are over 2.7x as large as the odds that he/she dose not have Cardiovascular Disease. For Diabetes Health Indicators Dataset, as variable HighBP increases by one unit(From 0 to 1), the odds that this person is in the target class (Diabetes_binary = 1) are over 2.1x as large as the odds that he/she is not in the target class (“1”).

From the coefficient tables, I do not have supporting evidence to prove my second hypothesis that Hypertension is the top 1 cause of both Cardiovascular Disease and Diabetes. For Cardiovascular Disease, based on the features in the datasets, cholesterol_well above normal, ap_hi_std, age_range_Above 60, cholesterol_above normal and BMI_range_Above 30 are the top 5 risk factors. I can tell that Cholesterol level is the most important indicator. For Diabetes Health Indicators Dataset, based on the model result, CholCheck, BMI_range_Above 30, HighBP, HighChol and BMI_range_25, 30 are the top 5 risk factors. BMI and CholCheck are more important than HighBP.

In conclusion, although High Blood Pressure is not the top 1 cause of both Cardiovascular Disease and Diabetes, it is still highly correlated to Cardiovascular Disease and Diabetes. Therefore, avoiding high blood pressure can still help you stay healthy and keep you away from expensive medical expenses in the future. In addition to that, having regular Cholesterol check and healthy BMI are also important for people to avoid Cardiovascular Disease and Diabetes.

Some interesting findings:

The coefficients of smoking and alcohol drinking is relatively small.

The highest odd I observed is that Smoker has odd 1.005192 during Diabetes regression analysis. The models suggest that smoking or drinking does not make people more susceptible to cardiovascular disease or diabetes.

Men are more likely to have Cardiovascular Disease or Diabetes

The odd of gender_Female(1 represent that female equals to true) from Cardiovascular dataset is 0.951921 and odd of Sex(1 represent male) from diabetes dataset is 1.300673.

Future study and research:

Diseases analysis is a trending topic nowadays. By incorporating machine learning or AI, hospitals can better serve patients and supports doctors in medical diagnosis. Although my project is mainly focusing on helping us to understand the correlation between hypertension and Cardiovascular Disease and Diabetes, the machine learning model can be used to support medical diagnosis if I can further improve the accuracy of prediction. To achieve that goal, I can first optimize the accuracy by changing the parameters during model construction or incorporating other models. The second way to improve the model is to have more features included.

Here is an interesting artical disscuing machine learning in healthcare: https://healthitanalytics.com/features/how-machine-learning-is-transforming-clinical-decision-support-tools

I have to mention that my research and analysis of the features from the two datasets did not prove any causal relationship between the features and the target class (Diabetes_binary==1 or cardio == 1). The report only shows what I have observed from the two data sets based on the analysis method I chose.